The following material has been adapted from the 2023 Swiss Institute of Bioinformatics single cell RNA sequencing course (https://sib-swiss.github.io/single-cell-training/) authored by Tania Wyss, Rachel Marcone-Jeitziner, Geert van Geest, and Patricia Palagi
library(Seurat)
library(ggplot2)
library(dplyr)
load("seu_preprocess_dimreduction.RData")
The FindNeighbors function adds a graph to the Seurat object that is used for clustering (FindClusters used to do this automatically in earlier versions, but not any more). Key arguments are dims (the number of PCs) and k.param (the number of neighbours) - default is 20. Default is usually fine but sometimes it helps to have the same number of neighbours as in the RunUMAP function (default for RunUMAP is 30). If clusters are mixed together on the UMAP plot then matching n.neighbors in RunUMAP and k.param in FindNeighbors can sometimes help.
seu <- Seurat::FindNeighbors(seu, dims = 1:25)
## Computing nearest neighbor graph
## Computing SNN
There are many important arguments for FindClusters, particularly the resolution, algorithm and method. The default algorithm in Seurat is the Louvain algorithm. The Leiden algorithm is an improvement on Louvain but it does require the installation of Python and the leidenalg package (see Appendix). We will use Louvain today but it is worth looking into setting up Leiden clustering.
# Run FindClusters at multiple resolutions
seu <- Seurat::FindClusters(seu, resolution = seq(0.1, 0.8, by=0.1))
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 6830
## Number of edges: 270007
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9788
## Number of communities: 7
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 6830
## Number of edges: 270007
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9663
## Number of communities: 12
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 6830
## Number of edges: 270007
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9571
## Number of communities: 14
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 6830
## Number of edges: 270007
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9491
## Number of communities: 16
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 6830
## Number of edges: 270007
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9413
## Number of communities: 17
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 6830
## Number of edges: 270007
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9341
## Number of communities: 18
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 6830
## Number of edges: 270007
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9267
## Number of communities: 19
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 6830
## Number of edges: 270007
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9196
## Number of communities: 21
## Elapsed time: 0 seconds
head(seu[[]])
## orig.ident nCount_RNA nFeature_RNA percent.mito
## PBMMC-1_AAACCTGCAGACGCAA-1 PBMMC-1 2401 909 2.540608
## PBMMC-1_AAACCTGTCATCACCC-1 PBMMC-1 3532 760 5.181200
## PBMMC-1_AAAGATGCATAAAGGT-1 PBMMC-1 3972 1215 4.934542
## PBMMC-1_AAAGCAAAGCAGCGTA-1 PBMMC-1 3569 894 3.250210
## PBMMC-1_AAAGCAACAATAACGA-1 PBMMC-1 2982 730 3.688799
## PBMMC-1_AAAGCAACATCAGTCA-1 PBMMC-1 22284 3108 3.181655
## percent.ribo percent.globin ident sum detected
## PBMMC-1_AAACCTGCAGACGCAA-1 28.65473 0.1665973 PBMMC-1 2401 909
## PBMMC-1_AAACCTGTCATCACCC-1 55.03964 0.1981880 PBMMC-1 3532 760
## PBMMC-1_AAAGATGCATAAAGGT-1 30.43807 0.3776435 PBMMC-1 3972 1215
## PBMMC-1_AAAGCAAAGCAGCGTA-1 55.02942 0.3642477 PBMMC-1 3569 894
## PBMMC-1_AAAGCAACAATAACGA-1 54.49363 0.1006036 PBMMC-1 2982 730
## PBMMC-1_AAAGCAACATCAGTCA-1 23.40693 36.9682283 PBMMC-1 22284 3108
## percent.top_50 percent.top_100 percent.top_200
## PBMMC-1_AAACCTGCAGACGCAA-1 43.27364 56.35152 68.05498
## PBMMC-1_AAACCTGTCATCACCC-1 60.64553 75.79275 83.32390
## PBMMC-1_AAAGATGCATAAAGGT-1 41.69184 56.92346 69.36052
## PBMMC-1_AAAGCAAAGCAGCGTA-1 51.05071 68.75876 78.53741
## PBMMC-1_AAAGCAACAATAACGA-1 53.75587 71.73038 81.52247
## PBMMC-1_AAAGCAACATCAGTCA-1 58.06408 69.09891 76.53025
## percent.top_500 total S.Score G2M.Score Phase
## PBMMC-1_AAACCTGCAGACGCAA-1 82.96543 2401 -0.2630557 -0.5250315 G1
## PBMMC-1_AAACCTGTCATCACCC-1 92.63873 3532 -0.4838513 -0.7964061 G1
## PBMMC-1_AAAGATGCATAAAGGT-1 81.99899 3972 -0.6049204 -1.0335645 G1
## PBMMC-1_AAAGCAAAGCAGCGTA-1 88.96049 3569 -0.5251903 -0.8884195 G1
## PBMMC-1_AAAGCAACAATAACGA-1 92.28706 2982 -0.4301263 -0.6948718 G1
## PBMMC-1_AAAGCAACATCAGTCA-1 83.75067 22284 -0.3034593 -2.3294031 G1
## RNA_snn_res.0.1 RNA_snn_res.0.2 RNA_snn_res.0.3
## PBMMC-1_AAACCTGCAGACGCAA-1 3 9 9
## PBMMC-1_AAACCTGTCATCACCC-1 1 3 1
## PBMMC-1_AAAGATGCATAAAGGT-1 2 2 7
## PBMMC-1_AAAGCAAAGCAGCGTA-1 1 3 1
## PBMMC-1_AAAGCAACAATAACGA-1 1 3 1
## PBMMC-1_AAAGCAACATCAGTCA-1 4 4 2
## RNA_snn_res.0.4 RNA_snn_res.0.5 RNA_snn_res.0.6
## PBMMC-1_AAACCTGCAGACGCAA-1 9 9 9
## PBMMC-1_AAACCTGTCATCACCC-1 12 13 13
## PBMMC-1_AAAGATGCATAAAGGT-1 7 6 15
## PBMMC-1_AAAGCAAAGCAGCGTA-1 12 13 13
## PBMMC-1_AAAGCAACAATAACGA-1 12 13 13
## PBMMC-1_AAAGCAACATCAGTCA-1 1 7 6
## RNA_snn_res.0.7 RNA_snn_res.0.8 seurat_clusters
## PBMMC-1_AAACCTGCAGACGCAA-1 9 15 15
## PBMMC-1_AAACCTGTCATCACCC-1 13 13 13
## PBMMC-1_AAAGATGCATAAAGGT-1 16 17 17
## PBMMC-1_AAAGCAAAGCAGCGTA-1 13 13 13
## PBMMC-1_AAAGCAACAATAACGA-1 13 13 13
## PBMMC-1_AAAGCAACATCAGTCA-1 5 5 5
The clusters are named based on the assay name (RNA), the graph used (snn), and the resolution that the clustering was run at (res.0.1-res.0.8). The default clustering is “seurat_clusters” and is the most recently added clustering. Here we ran clustering with resolution 0.1 to 0.8 so the last cluster was at resolution 0.8. After clustering, the active.ident is now seurat_clusters. Let’s have a look at the different clusters on some UMAPs. I am only plotting them 4 at a time so that they are easier to see. Note: Clusters get progressively smaller so cluster 0 is always the biggest
# An easy way of seeing the size of each cluster
summary(seu$seurat_clusters)
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 844 633 582 563 523 477 440 417 412 298 255 245 228 175 161 137 135 112 71 69
## 20
## 53
# First four resolutions
Seurat::DimPlot(seu, group.by = c("RNA_snn_res.0.1","RNA_snn_res.0.2",
"RNA_snn_res.0.3","RNA_snn_res.0.4"),ncol=2)
# optionally save the UMAP plots using ggsave
#ggsave("find_clusters_res0.1to0.4_umap.png",width=20,height=20,units="cm")
# Last four resolutions but this time let's put the labels on the clusters
# and let's remove that annoying legend
Seurat::DimPlot(seu, group.by = c("RNA_snn_res.0.5","RNA_snn_res.0.6",
"RNA_snn_res.0.7","RNA_snn_res.0.8"),
ncol=2, label=T, combine=T) & NoLegend()
The lowest clustering resolution seems to “under-cluster” as it just identifies the major groups on the UMAP. The clusters don’t change much after about resolution 0.4.
On the earlier UMAPs, some of you may have noticed that there was separation based on sample. In some cases (eg disease vs control) this might be expected but these are three control samples so we would not expect there to be big differences.
DimPlot(seu, reduction = "umap",group.by = "orig.ident")
We know that there were more erythroid cells in PBMMC-2, but there is separation in other areas of the UMAP as well. Particularly at the bottom of the UMAP where there was high expression of the T cell genes CD3D and CD3E. Let’s use a stacked barplot to look at the proportion of cells within each cluster that come from each sample
# stacked barplot
ggplot(seu[[]], aes(fill=orig.ident,x=RNA_snn_res.0.3))+
geom_bar(position="fill") +
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0.5)) +
labs(fill="Sample")
#ggsave("louvain_samples_percluster_res0.3.png",width=15,height=10,units="cm")
# umap of the same cluster resolution
DimPlot(seu, reduction = "umap",group.by = "RNA_snn_res.0.3",label = T)
#ggsave("umap_louvain_res0.3.png",width=15,height=10,units="cm")
Focusing on the bottom group of cells; Cluster 2 is mostly sample 3, cluster 4 is mostly sample 2, and cluster 8 is mostly sample 1. To try to get the samples to cluster together we can try integration.
There are many batch correction/integration methods available including some that are built into Seurat. As a demonstration of using an external package, we are going to use Harmony instead. Harmony is a fast and widely used (over 1000 citations) method for batch correction/integration. It requires you to have run the preprocessing on a combined object of all samples (as we have done here already) rather than for each sample one at a time. For further information you can refer to https://portals.broadinstitute.org/harmony/SeuratV3.html. For information about the Seurat integration methods, refer to the Seurat website.
# Install harmony if you haven't already - install.packages("harmony")
library(harmony)
## Loading required package: Rcpp
# always set.seed before running harmony otherwise the results will change each time
# it doesn't have a seed argument included like Seurat functions
set.seed(123)
# just ignore the warning about the invalid name
seu <- RunHarmony(seu, group.by.vars = c("orig.ident"),
reduction = "pca",reduction.save = "harmony")
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony 4/10
## Harmony 5/10
## Harmony 6/10
## Harmony 7/10
## Harmony converged after 7 iterations
## Warning: Invalid name supplied, making object name syntactically valid. New
## object name is Seurat..ProjectDim.RNA.harmony; see ?make.names for more details
## on syntax validity
# we can run an ElbowPlot again to look at the harmony components
ElbowPlot(seu,reduction = "harmony",ndims = 50)
# looks similar to the PCA elbowplot so will go with 25 again
# let's compare the grouping on the harmony components and principal components
harmony1and2 <- DimPlot(seu,reduction="harmony",group.by = "orig.ident")
pca1and2 <- DimPlot(seu,reduction="pca",group.by = "orig.ident")
# plot them side by side
pca1and2 + harmony1and2
Firstly, it is a good idea to save the Seurat object before running RunUMAP as the functions will overwrite the results from the unintegrated data. Here I am making a new Seurat object called seu_noIntegrate that contains the unintegrated results before proceeding.
seu_noIntegrate <- seu
seu <- RunUMAP(seu, reduction = "harmony", dims = 1:25)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
seu <- FindNeighbors(object = seu, reduction = "harmony",dims = 1:25)
seu <- FindClusters(seu, resolution = seq(0.1, 0.8, by=0.1))
This loads the clusters and integration results that I had generated previously. RunHarmony and RunUMAP can yield slightly different results when run repeatedly on the same data.
load("seu_clusters_integrated_cell_assign.RData")
Seurat::DimPlot(seu,reduction = "umap",
group.by = "RNA_snn_res.0.3")
# Optionally save the graph as a png file
#ggsave("harmony_cluster_0.3.png",width=15,height=15,units="cm")
Seurat::DimPlot(seu_noIntegrate,reduction = "umap",
group.by = "RNA_snn_res.0.3")
#ggsave("unintegrated_cluster_0.3.png",width=15,height=15,units="cm")
Seurat::DimPlot(seu,reduction = "umap",
split.by = c("orig.ident"),group.by = "RNA_snn_res.0.3")
#ggsave("harmony_cluster_0.3_splitby_sample.png",width=20,height=10,units="cm")
Seurat::DimPlot(seu_noIntegrate,reduction = "umap",
split.by = c("orig.ident"),group.by = "RNA_snn_res.0.3")
#ggsave("unintegrated_cluster_0.3_splitby_sample.png",width=20,height=10,units="cm")
# Stacked barplot of the integrated clusters
# There are now fewer sample-specific clusters
ggplot(seu[[]], aes(fill=orig.ident,x=RNA_snn_res.0.3))+
geom_bar(position="fill") +
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0.5)) +
labs(fill="Sample")
#ggsave("harmony_integrated_barplot_res0.3.png",width=10,height=10,units="cm")
#################################################################################
The first step in cell type assignment is to decide on the cluster resolution that you want to use. Once you have decided on a cluster resolution, then you perform steps to identify the cell type each cluster corresponds to. Depending on your dataset you might want to do a “coarse” clustering where you just separate into cell types as well as a finer clustering into cell subtypes.
# The lower cluster resolutions looked better on the original data
Seurat::DimPlot(seu, group.by = c("RNA_snn_res.0.2","RNA_snn_res.0.3",
"RNA_snn_res.0.4","RNA_snn_res.0.5"),label = T)
Some of the clusterings are very similar and your clusters may look slightly different to what is here. I am going to use 0.3 for now (you can always change your mind later). Let’s set our active.ident as the 0.3 clustering
# the lower resolution at 0.3 looks reasonably good - not too many overlapping clusters
seu <- Seurat::SetIdent(seu, value = seu$RNA_snn_res.0.3)
# Let's plot the UMAP with labels
Seurat::DimPlot(seu, group.by = c("RNA_snn_res.0.3"),label = T)
We have talked a lot about erythroid cells, so let’s plot some markers of other cell types instead. We are saving the marker genes as vectors. Feel free to add marker genes for other cell types.
# Define our cell marker lists
tcell_genes <- c("IL7R", "LTB", "TRAC", "CD3D")
monocyte_genes <- c("CD14", "CST3", "CD68", "CTSS")
ncol controls how many columns there are in the output when you are making multiple plots at the same time. By default, the normalised counts are plotted
Seurat::FeaturePlot(seu, features=tcell_genes, ncol=2)
# Looks like clusters 0 and 10 are highest for the T cell markers on the UMAP
# Let's confirm that with a violin plot instead
Seurat::VlnPlot(seu,group.by = ("RNA_snn_res.0.3"),
features = tcell_genes,
ncol = 2)
Sometimes protein-level cell markers are not highly expressed in RNA-seq data. Sometimes, commonly used cell markers aren’t actually as specific to a particular cell type as people would like them to be as well. If there is no universally recognised gold-standard marker for a cell type, then it is always a good idea to plot multiple markers. In the plot above, LTB is up in multiple clusters but if you look across all genes then it is clear that the genes are only consistently higher in clusters 0 and 10.
# Now let's do the same for the monocyte genes starting with a UMAP
Seurat::FeaturePlot(seu, features=monocyte_genes, ncol=2)
# Let's check with a violin plot as well
Seurat::VlnPlot(seu,
features = monocyte_genes,
ncol = 2)
# Cluster 2 shows consistently high expression with 7 high for some markers
Gene names can differ between species (eg mouse and human) so make sure that you are using the correct name. R may return an error saying gene not found when in reality you just aren’t using the right gene name for your target species. Similarly, publications will often refer to protein names which can often differ from the name of the gene that encodes the protein (eg NK1.1 comes from the Klrb1c gene in mice). NCBI gene is a good place to check to confirm that you are using the correct gene name https://www.ncbi.nlm.nih.gov/gene.
An alternative approach to plotting multiple genes is to use AddModuleScore to make a combined score from a list of marker genes. The CellCycleScoring function that we used earlier actually uses the AddModuleScore function to summarise the expression of the cell cycle marker genes. Here we will use it for the T cell and monocyte marker genes.
# Note that AddModuleScore expects a list. If you just put in tcell_genes instead
# of list(tcell_genes) then it would run AddModuleScore on each gene individually
seu <- Seurat::AddModuleScore(seu,
features = list(tcell_genes),
name = "tcell_genes")
Results are stored in the metadata. Note that AddModuleScore added a 1 on the end of the specified name (ie tcell_genes1 instead of tcell_genes). This is useful when features include multiple genesets. Now let’s plot the results.
# UMAP of tcell_genes1
Seurat::FeaturePlot(seu, features="tcell_genes1")
# VlnPlot of tcell_genes1
Seurat::VlnPlot(seu, features="tcell_genes1")
AddModuleScore is fast, simple, and produces similar results to available alternatives. One of the issues that I have with AddModuleScore is that it returns negative values which can be misinterpreted as downregulation. AddModuleScore also generates the score using the average expression of other randomly selected genes (the number of genes is specified by the ctrl argument to AddModuleScore - default value is 100). This is fine most of the time but can be problematic if you subset your Seurat object and then rerun AddModuleScore again. There is an alternative package called UCell that produces a similar score, is more reproducible, and returns a score between 0 and 1. See appendix for details.
The FindAllMarkers function sequentially runs differential expression between each cluster and all other cells (ie cluster 0 vs cluster1-13 combined). The idea is to find the genes that are highest in each cluster compared to other clusters. A column by column explanation of the output object (de_genes) is in the appendix.
de_genes <- Seurat::FindAllMarkers(seu, min.pct = 0.25, ident="RNA_snn_res.0.3",
only.pos = TRUE)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
# Let's examine the output
head(de_genes)
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## IFITM1 0 2.179555 0.766 0.193 0 0 IFITM1
## CD3D 0 2.162188 0.813 0.214 0 0 CD3D
## TRAC 0 2.119884 0.710 0.178 0 0 TRAC
## CD3E 0 1.939660 0.670 0.124 0 0 CD3E
## LTB 0 1.909197 0.824 0.375 0 0 LTB
## JUNB 0 1.860133 0.892 0.649 0 0 JUNB
# Now let's subset the object to only keep the genes below an adjusted p-value of 0.05
de_genes <- subset(de_genes, de_genes$p_val_adj < 0.05)
# Another way of subsetting using the %in% operator
de_genes[de_genes$gene %in% tcell_genes,]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## CD3D 0.000000e+00 2.1621879 0.813 0.214 0.000000e+00 0 CD3D
## TRAC 0.000000e+00 2.1198845 0.710 0.178 0.000000e+00 0 TRAC
## LTB 0.000000e+00 1.9091972 0.824 0.375 0.000000e+00 0 LTB
## IL7R 1.448832e-295 1.6810492 0.487 0.100 2.705405e-291 0 IL7R
## LTB1 5.390164e-35 1.1532332 0.671 0.465 1.006505e-30 6 LTB
## LTB2 7.063843e-24 0.8327548 0.793 0.468 1.319031e-19 9 LTB
## CD3D1 5.352733e-39 1.4531134 0.771 0.341 9.995157e-35 10 CD3D
## TRAC1 8.990548e-36 1.4475279 0.700 0.289 1.678805e-31 10 TRAC
## IL7R1 2.668579e-10 0.8085502 0.386 0.183 4.983038e-06 10 IL7R
All 4 markers came up for both cluster 0 and 10 ### Saving the output - Be careful with Excel It is generally a good idea to save the output of FindAllMarkers as you may want to run further analysis on the output (eg enrichment) and it can be time consuming to run. A safe way of saving the output is as an RData object but you may also want to write the output to a table (txt or csv) for showing your results to supervisors/collaborators. However, Excel has an unfixable habit of converting some gene names (eg MARCH1) to dates. If you save your result as a csv, open it in Excel, and then read it back into R again, you will have dates in your gene column.
write.csv(de_genes,
file="de_genes_FindAllMarkers.csv",
row.names = F, quote = F)
save(de_genes,file="de_genes_FindAllMarkers.RData")
The manual assignment workflow above involves choosing your cluster resolution, and then identifying each cluster based on the expression levels of known markers. This approach is reliable, but it can be subjective, slow, and does require some expertise about which cell markers are the best. If multiple cell type markers are present in a cluster then it can be an indicator of doublets, particularly if expression is inconsistent across cells. I have included instructions on how you can add manual cluster assignments to the Seurat metadata in the Appendix.
A quicker method of performing cell type assignment without as much biology knowledge is to use one of the many available tools for cell type assignment. This is an example using two Bioconductor packages.
# BiocManager::install(c("celldex","SingleR"))
# Note that I have hidden the console output of loading the packages
library(celldex)
library(SingleR)
singleR has multiple reference datasets. We will use a dataset of Hematopoietic cells.
# load the dataset as ref
ref <- celldex::NovershternHematopoieticData()
## snapshotDate(): 2022-10-31
## see ?celldex and browseVignettes('celldex') for documentation
## loading from cache
## see ?celldex and browseVignettes('celldex') for documentation
## loading from cache
class(ref)
## [1] "SummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"
# show how many marker genes there are for each cell type
table(ref$label.main)
##
## B cells Basophils CD4+ T cells CD8+ T cells CMPs
## 29 6 21 24 4
## Dendritic cells Eosinophils Erythroid cells GMPs Granulocytes
## 10 5 33 4 13
## HSCs Megakaryocytes MEPs Monocytes NK cells
## 14 12 9 9 14
## NK T cells
## 4
SingleR compares our data to the reference to work out the likely annotations. Since SingleR is a Bioconductor package and we have a Seurat object rather than a SingleCellExperiment, we have to extract our normalised counts using the GetAssayData function (remember slot=data for normalised counts, slot=counts for unnormalised counts).
seu_SingleR <- SingleR::SingleR(test = Seurat::GetAssayData(seu, slot = "data"),
ref = ref,
labels = ref$label.main)
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
# Quick check of the output
head(seu_SingleR)
## DataFrame with 6 rows and 4 columns
## scores labels
## <matrix> <character>
## PBMMC-1_AAACCTGCAGACGCAA-1 0.216202:0.197296:0.086435:... B cells
## PBMMC-1_AAACCTGTCATCACCC-1 0.143005:0.129582:0.170521:... CD8+ T cells
## PBMMC-1_AAAGATGCATAAAGGT-1 0.113423:0.196264:0.111341:... Monocytes
## PBMMC-1_AAAGCAAAGCAGCGTA-1 0.166749:0.168504:0.239303:... CD4+ T cells
## PBMMC-1_AAAGCAACAATAACGA-1 0.102549:0.103979:0.178762:... CD8+ T cells
## PBMMC-1_AAAGCAACATCAGTCA-1 0.181526:0.147693:0.115841:... Erythroid cells
## delta.next pruned.labels
## <numeric> <character>
## PBMMC-1_AAACCTGCAGACGCAA-1 0.1471065 B cells
## PBMMC-1_AAACCTGTCATCACCC-1 0.0753696 CD8+ T cells
## PBMMC-1_AAAGATGCATAAAGGT-1 0.1274524 Monocytes
## PBMMC-1_AAAGCAAAGCAGCGTA-1 0.0845267 CD4+ T cells
## PBMMC-1_AAAGCAACAATAACGA-1 0.0607683 CD8+ T cells
## PBMMC-1_AAAGCAACATCAGTCA-1 0.1057135 Erythroid cells
# Plot a heatmap using the SingleR plotting function
SingleR::plotScoreHeatmap(seu_SingleR)
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
Some cell types are clear but there are also a lot of overlaps (eg basophils and eosinophils). Some of the cell types look like there aren’t many cells in them at all.
#
# Some clusters look very small, let's extract the SingleR annotation as a vector
# and then make a table of the results
SingleR_annot <- seu_SingleR$labels
t <- table(SingleR_annot)
t
## SingleR_annot
## B cells Basophils CD4+ T cells CD8+ T cells CMPs
## 775 180 912 706 116
## Dendritic cells Eosinophils Erythroid cells GMPs Granulocytes
## 130 56 2206 14 494
## HSCs Megakaryocytes MEPs Monocytes NK cells
## 283 1 26 907 13
## NK T cells
## 11
We can’t do much with tiny clusters so we are going to arbitrarily remove the celltypes that have less than 10 cells
# Make a vector of names cell types with less than 10 cells and then change it
# to unassigned
other <- names(t)[t < 10]
SingleR_annot[SingleR_annot %in% other] <- "unassigned"
# And finally, let's add the labels into the metadata of the Seurat object
seu <- AddMetaData(seu, metadata= SingleR_annot,col.name="SingleR_annot")
Note that SingleR didn’t actually use our clustering. It purely classified the cells into the groups.
# plot the clusters and the annotation
DimPlot(seu,reduction = "umap",group.by = c("SingleR_annot","RNA_snn_res.0.3"),label = T,label.size = 3)
There is a lot of mixing within some clusters. It seems to have worked better at separating T cells and erythroid cells between clusters 0 and 1.
# Cell types within each sample
ggplot(seu[[]], aes(fill=SingleR_annot,x=orig.ident))+
geom_bar(position="fill") +
theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0.5)) +
labs(fill="Cell type")
Now let’s have a look at the breakdown of celltypes per cluster. This time we will use dittoseq as it is designed to choose more distinct colours.
dittoSeq::dittoBarPlot(seu, var = "SingleR_annot", group.by = "RNA_snn_res.0.3")
# save the seurat object with clusters
save(seu,file="seu_clusters_integrated_cell_assign.RData")
#########################################################################################
#########################################################################################
The Leiden algorithm will run more efficiently with igraph installed as well. Install leidenalg using either;
# pip install leidenalg igraph
# or
# conda install -c conda-forge leidenalg python-igraph
The Leiden algorithm can be specified using algorithm=4. The more efficient “igraph” method can be specified using the method argument
seu <- Seurat::FindClusters(seu, resolution = seq(0.1, 0.8, by=0.1),
algorithm = 4,method = "igraph")
Install UCell from Bioconductor and then load the library.
# BiocManager::install("UCell")
library(UCell)
seu <- AddModuleScore_UCell(seu, features = list(tcell_genes=tcell_genes))
# look at the names of metadata
names(seu[[]])
## [1] "orig.ident" "nCount_RNA" "nFeature_RNA"
## [4] "percent.mito" "percent.ribo" "percent.globin"
## [7] "ident" "sum" "detected"
## [10] "percent.top_50" "percent.top_100" "percent.top_200"
## [13] "percent.top_500" "total" "S.Score"
## [16] "G2M.Score" "Phase" "RNA_snn_res.0.1"
## [19] "RNA_snn_res.0.2" "RNA_snn_res.0.3" "RNA_snn_res.0.4"
## [22] "RNA_snn_res.0.5" "RNA_snn_res.0.6" "RNA_snn_res.0.7"
## [25] "RNA_snn_res.0.8" "seurat_clusters" "tcell_genes1"
## [28] "SingleR_annot" "tcell_genes_UCell"
# the variable name for the UCell score is ends with _UCell
# UMAP of tcell_genes1
Seurat::FeaturePlot(seu, features="tcell_genes_UCell")
# VlnPlot of tcell_genes1
Seurat::VlnPlot(seu, features="tcell_genes_UCell")
The message from the violin plot is the same as AddModuleScore, but I like that it returns a score between 0 and 1
de_genes_top10 <- de_genes %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)
# scale before running the heatmap RNA_snn_res.0.3
library(viridis)
## Loading required package: viridisLite
# Heatmaps will only work on the genes that are in the scale.data slot
seu <- ScaleData(seu,features=de_genes_top10$gene)
## Centering and scaling data matrix
# I've removed the legend with NoLegend and taken away the y-axis as otherwise
# there would be over 100 gene names down the side
DoHeatmap(seu, features=de_genes_top10$gene, group.by="RNA_snn_res.0.3", label=T) +
scale_fill_gradientn(colors=viridis(256)) + NoLegend() + theme(axis.text.y = element_blank())
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
Some clusters have very similar expression - eg 4 and 5. There are also some cells in cluster 1 that seem to have more similar expression to cells in cluster 0. This might be due to us only plotting the top 10 genes from each cluster
# Initialise a column
seu$manual_celltype <- rep("other")
# Assign a value to the manual celltype column based on the cluster
seu$manual_celltype[seu$RNA_snn_res.0.3==0] <- "T_cell"
seu$manual_celltype[seu$RNA_snn_res.0.3==10] <- "T_cell"
seu$manual_celltype[seu$RNA_snn_res.0.3==1] <- "Erythroid"
seu$manual_celltype[seu$RNA_snn_res.0.3==5] <- "Erythroid"
seu$manual_celltype[seu$RNA_snn_res.0.3==3] <- "Erythroid"
seu$manual_celltype[seu$RNA_snn_res.0.3==11] <- "Erythroid"
seu$manual_celltype[seu$RNA_snn_res.0.3==2] <- "Monocyte"
DimPlot(seu,reduction = "umap",group.by = c("SingleR_annot","manual_celltype"),label = T,ncol=1,label.size = 3)
Comments
There seems to be better integration following harmony, particularly on the left side of the plot. As expected, there are still areas that are mostly PBMMC-2 but that is because that sample had more erythroid cells. This is actually a good indication that there hasn’t been “over-correction” of the batch effect.